Chapter 4. Clustering and classification

I completed all the DataCamp exercises prior to proceeding to this exercise. I also looked up some details how to use the RMarkdown more productively and hopefully this report will be clearer than the previous ones.

Today we are looking at the Boston crime

Boston Data set information:

Variables Description
crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town.
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox nitrogen oxides concentration (parts per 10 million).
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
ptratio pupil-teacher ratio by town.
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat lower status of the population (percent).
medv median value of owner-occupied homes divided by $1000s.
# To empty the memory after the excercise before this
# rm(list=ls())
# Load data
# setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # to set working directory to source file

# load the data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

All are numeric variables except chas and rad. The data has demographical data of boston including tax and other information possibly linked to crime rates within the city.

my_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) +
    geom_point() +
    geom_smooth(method=loess, fill="red", color="red", ...) +
    geom_smooth(method=lm, fill="blue", color="blue", ...)
  p
}
g = ggpairs(Boston,columns = c(1:14), lower = list(continuous = my_fn))
g

  1. It seems that indus, nox, rad, tax and lstat have positive correlation with crime, and dis, black and medv have negative correlation with crime.
  2. Many of the variables seems to have non-linear relationships (red lines in the picture), however the variables with high correlation seems to be more linear.
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

More focused figure on correlation, since they are hardly visible with this many variables.

# Scaling the variables with mean and standard deviation with scale()
Boston_scaled <- as.data.frame(scale(Boston))
# create a quantile vector of crime and print it
bins <- quantile(Boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crim'
crime <- cut(Boston_scaled$crim, breaks = bins, include.lowest= TRUE, label = c("low","med_low","med_high", "high"))

# remove original crim from the dataset
Boston_scaled <- dplyr::select(Boston_scaled, -crim)

# add the new categorical value to scaled data
Boston_scaled <- data.frame(Boston_scaled, crime)

# number of rows in the Boston dataset
n <-nrow(Boston)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- Boston_scaled[ind,]

# create test set
test <- Boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <-test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

summary(train)
##        zn               indus                chas         
##  Min.   :-0.48724   Min.   :-1.556302   Min.   :-0.27233  
##  1st Qu.:-0.48724   1st Qu.:-0.875579   1st Qu.:-0.27233  
##  Median :-0.48724   Median :-0.210890   Median :-0.27233  
##  Mean   : 0.01699   Mean   :-0.001508   Mean   : 0.02977  
##  3rd Qu.: 0.04872   3rd Qu.: 1.014995   3rd Qu.:-0.27233  
##  Max.   : 3.80047   Max.   : 2.420170   Max.   : 3.66477  
##       nox                   rm                 age           
##  Min.   :-1.4299136   Min.   :-3.876413   Min.   :-2.333128  
##  1st Qu.:-0.8776070   1st Qu.:-0.576252   1st Qu.:-0.834844  
##  Median :-0.1440749   Median :-0.113340   Median : 0.317068  
##  Mean   :-0.0007691   Mean   :-0.006532   Mean   :-0.003215  
##  3rd Qu.: 0.5980871   3rd Qu.: 0.498658   3rd Qu.: 0.913895  
##  Max.   : 2.7296452   Max.   : 3.551530   Max.   : 1.116390  
##       dis                 rad                tax         
##  Min.   :-1.265817   Min.   :-0.98187   Min.   :-1.3127  
##  1st Qu.:-0.821738   1st Qu.:-0.63733   1st Qu.:-0.7698  
##  Median :-0.263945   Median :-0.52248   Median :-0.4642  
##  Mean   : 0.005829   Mean   :-0.02017   Mean   :-0.0127  
##  3rd Qu.: 0.688655   3rd Qu.: 1.65960   3rd Qu.: 1.5294  
##  Max.   : 3.956602   Max.   : 1.65960   Max.   : 1.7964  
##     ptratio             black               lstat         
##  Min.   :-2.70470   Min.   :-3.903331   Min.   :-1.52961  
##  1st Qu.:-0.67232   1st Qu.: 0.204020   1st Qu.:-0.81403  
##  Median : 0.18221   Median : 0.381193   Median :-0.22518  
##  Mean   :-0.01833   Mean   : 0.009689   Mean   :-0.01128  
##  3rd Qu.: 0.80578   3rd Qu.: 0.434619   3rd Qu.: 0.62203  
##  Max.   : 1.63721   Max.   : 0.440616   Max.   : 3.54526  
##       medv               crime    
##  Min.   :-1.90634   low     :105  
##  1st Qu.:-0.62605   med_low : 99  
##  Median :-0.12317   med_high:103  
##  Mean   : 0.01877   high    : 97  
##  3rd Qu.: 0.42320                 
##  Max.   : 2.98650

Now we have generated the test dataset with the new categorial variable “crime”, which is based on the quantiles of the original numeric variable.

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2599010 0.2450495 0.2549505 0.2400990 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       0.98181218 -0.8874203 -0.08484810 -0.8681635  0.44327317
## med_low  -0.09246736 -0.2564451  0.04582045 -0.5410539 -0.14233658
## med_high -0.38649973  0.1872612  0.22458650  0.4101587  0.09019894
## high     -0.48724019  1.0172187 -0.06938576  1.0532418 -0.45754612
##                 age        dis        rad        tax     ptratio
## low      -0.9154949  0.8784211 -0.6920200 -0.7326859 -0.47479927
## med_low  -0.2613505  0.3669684 -0.5480059 -0.4597182 -0.04944639
## med_high  0.4062608 -0.4030603 -0.3886824 -0.2862109 -0.27454203
## high      0.8129587 -0.8731332  1.6371072  1.5133254  0.77958792
##                black       lstat        medv
## low       0.38334881 -0.77545046  0.53935749
## med_low   0.30759039 -0.11654502 -0.01630709
## med_high  0.08955669  0.02991653  0.15931666
## high     -0.78364082  0.87959845 -0.65818815
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.108786587  0.69635890 -0.88287401
## indus   -0.025195867 -0.11859408  0.33131889
## chas    -0.078688136 -0.07932550  0.11513358
## nox      0.421648091 -0.74703377 -1.43562384
## rm      -0.081555248 -0.08263213 -0.19402192
## age      0.258229188 -0.33591952  0.15274859
## dis     -0.126865225 -0.23841496  0.37841106
## rad      2.987922913  0.96187096 -0.06539660
## tax      0.004273758 -0.08664326  0.55862554
## ptratio  0.107441591  0.03939024 -0.26014949
## black   -0.136854750 -0.01716850  0.05077511
## lstat    0.236880033 -0.20362552  0.28345251
## medv     0.209732025 -0.31553234 -0.18743418
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9444 0.0412 0.0144
g <- ggord(lda.fit, train$crime, ellipse_pro = 0.95, vec_ext = 2, alpha_el = 0.3, size = 2)
g  + theme_tufte()  + geom_rangeframe()

  1. We get the points seperated only by the 1st LD and rad parameter in the high group.
  2. From the LDA summary, we can see that how the first linear discriminant explain 94.72 % of the between-group variance in the Boston dataset.
  3. Also the rad (= index of accessibility to radial highways) has relatively high value in the first discriminant, so it proves to be uselful in the classification. rad had negative correlation with crime.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14       8        0    0
##   med_low    4      19        4    0
##   med_high   1       9       13    0
##   high       0       0        0   30

In the high -crime quartile, we could predict all cases, but for other classes we didn’t do so well. However, LDA seems to be able to predict our cases with quite good accuracy.

# k-means clustering
BostonS <- scale(Boston) # standardize variables

wss <- (nrow(BostonS)-1)*sum(apply(BostonS,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(BostonS, 
    centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

km <-kmeans(BostonS, centers = 4)

# plot the Boston dataset with clusters
pairs(BostonS, col = km$cluster)

We see similar results than in LDA

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

Now we can see the 3D picture of the clusters!